Skip to content

Backtransform data before mapping statistics #4194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

yutannihilation
Copy link
Member

@yutannihilation yutannihilation commented Sep 6, 2020

Fix #4155

On #4155 (comment), I wrote

The staged values should be bypass the transformation as probably it is already transformed

but, I was wrong. The transformation that the data needs to be bypassed is not the one after mapping the calculated variables, but the one before mapping. Otherwise, if we do another calculation over the calculated value, the result is still wrong even if it skip the transformation at the end of Layer$map_statistic().

Here's an example. When evaluating after_stat, the actual scaled value of x is 4, and the result of x / 2 will be 2, which will locate at 4 in sqrt scale. But, in actual, x / 2 is 8.

d <- data.frame(value = 16)

ggplot(d) +
  geom_point(aes(stage(value, after_stat = x / 2), 0))

The transformation we want to bypass is this one. But, the operations that come after this (i.e. training and mapping positions) depends on the scaled values, so we cannot skip this transformation.

ggplot2/R/plot-build.r

Lines 60 to 61 in 6b8dba0

# Transform all scales
data <- lapply(data, scales_transform_df, scales = scales)

After some thinking, I come to a conclusion that the values need to be back-transformed before evaluating after_stat. There might be cleaner mechanism for this, but I believe this is a realistic solution at the moment...


library(ggplot2)

d <- data.frame(value = 16)

# Current behaviour

ggplot(d) +
  geom_point(aes(value / 2, 0), colour = "green", size = 10) +
  geom_point(aes(stage(value, after_stat = x / 2), 1), colour = "purple", size = 10) +
  scale_x_sqrt(limits = c(0, 16), breaks = c(2, 4, 8))

# This pull request

devtools::load_all("~/GitHub/ggplot2/")
#> ℹ Loading ggplot2

ggplot(d) +
  geom_point(aes(value / 2, 0), colour = "green", size = 10) +
  geom_point(aes(stage(value, after_stat = x / 2), 1), colour = "purple", size = 10) +
  scale_x_sqrt(limits = c(0, 16), breaks = c(2, 4, 8))

Created on 2022-05-15 by the reprex package (v2.0.1)

@clauswilke
Copy link
Member

Something makes me uncomfortable about this. At a minimum, I think this needs much more documentation to really explain the logic at each step.

@thomasp85
Copy link
Member

will this not potentially break every plot that has ever used stat() or ..aes..?

I must admit that I think it is better to document what has happened to the data up until the after_stat phase and simply let that be the end of it...

@clauswilke
Copy link
Member

clauswilke commented Sep 6, 2020

@thomasp85 Something like this PR may very well be needed. I've dealt with a similar problem in coords. However, I think we should do some more due diligence before merging.

It'd be a breaking change, but I think the cases where this would hit would be quite rare, since it requires a combination of using stat() and using a scale with transformation, and the latter is usually applied to positions and the former is usually not. I'm not sure I've ever made a plot that meets this criterion (except my reprex below).

I'm providing another reprex, using the current ggplot2 (without this patch), that shows that the current behavior is confusing and inconsistent with how everything else in ggplot2 works.

library(ggplot2)
library(ggridges)

df <- data.frame(x = rexp(100))

# works as expected
ggplot(df, aes(x, y = 0, fill = stat(x) < 1)) +
  geom_density_ridges_gradient()
#> Picking joint bandwidth of 0.287

# confusing result. why are almost all points green?
ggplot(df, aes(x, y = 0, fill = stat(x) < 1)) +
  geom_density_ridges_gradient() +
  scale_x_log10()
#> Picking joint bandwidth of 0.194

# we have to transform x according to the scale before things work,
# that seems strange from a user's perspective
ggplot(df, aes(x, y = 0, fill = stat(x) < 0)) +
  geom_density_ridges_gradient() +
  scale_x_log10()
#> Picking joint bandwidth of 0.194

Created on 2020-09-06 by the reprex package (v0.3.0)

@thomasp85
Copy link
Member

I'm not convinced, but I'm also not 100% against. Scale transformations are applied as the very first thing in the data pipeline which is something there is value in teaching and understanding. It seems weird to me to cherry-pick parts of the data-pipeline and undue them at certain points...

Again, not saying this shouldn't be done, but I'd be weary of doing this since, as you note yourself, this is an extreme edge case and the change required is not nothing...

@clauswilke
Copy link
Member

Another example. This one I find even more disconcerting, because it produces a plot that is objectively wrong.

library(ggplot2)
library(ggridges)

df <- data.frame(x = rexp(100))

ggplot(df, aes(x, y = 0, fill = stat(x))) +
  geom_density_ridges_gradient() +
  scale_x_log10() +
  scale_fill_viridis_c()
#> Picking joint bandwidth of 0.171

Created on 2020-09-06 by the reprex package (v0.3.0)

@thomasp85
Copy link
Member

Well, it is only objectively wrong because we strip stat() from the guide title 😉

@clauswilke
Copy link
Member

A possible compromise could be to leave after_stat() as is but add a new function that explicitly backtransforms. E.g. after_stat_bt(). Maybe somebody can come up with a better name.

@thomasp85
Copy link
Member

I'd much rather explore this (or make it an argument to after_stat()). In general I'd be very skeptic from a performance point of view to transform all aesthetics back and forth for no reason at all in 99.9% of all plot cases

@yutannihilation
Copy link
Member Author

yutannihilation commented Sep 6, 2020

I too feel unhappy about this PR, and half of the purposes of this is to share the uncomfortableness with you so that you'll come up with some superseding solution, as you always did :)

I've dealt with a similar problem in coords.

Yeah, actually it made me think back-transformation is needed here as well.

In general I'd be very skeptic from a performance point of view to transform all aesthetics back and forth for no reason at all in 99.9% of all plot cases

Probably we can easily skip the back-transformation by checking if the trans is identity? Besides, the back-transformation only happens when there's some Stat (otherwise it returns early), so I don't think it makes up 99.9%.

Whether or not we end up adopting this, I think this is worth breaking change as the name after_stat() doesn't sound right, especially, now that we have after_scale(), which makes the users think after_stat() is before scale. And the actual breakage would be rare.

@clauswilke
Copy link
Member

Whether or not we end up adopting this, I think this is worth breaking change as the name after_stat() doesn't sound right, especially, now that we have after_scale(), which makes the users think after_stat() is before scale. And the actual breakage would be rare.

I'm not suggesting changing these names again, but I did realize today that I've been always confused by after_scale() because it probably should be called after_mapping(). The data gets transformed (some may think of this as scaling the data), then the statistical transformations are applied, and finally the data gets mapped.

@thomasp85
Copy link
Member

Besides, the back-transformation only happens when there's some Stat (otherwise it returns early), so I don't think it makes up 99.9%

Sorry - I confused myself as to when this function would get called.

I'm not suggesting changing these names again, but I did realize today that I've been always confused by after_scale() because it probably should be called after_mapping(). The data gets transformed (some may think of this as scaling the data), then the statistical transformations are applied, and finally the data gets mapped.

I'm sure after_mapping() would be more confusing since we use the mapping term for a different purpose in the API (assigning data to aesthetics). I feel after_scale() is correct. This step happens after all scaling operations (transformation, censoring, mapping etc) has completed. I personally don't think these names in any way imply that all operations in the scale or the stat happens in one chunk, which is what I think all this confusing is from.

@thomasp85
Copy link
Member

@clauswilke what kind of related issues have you face in coords?

@clauswilke
Copy link
Member

For geoms such geom_hline() or geom_vline(), I needed to backtransform the range to make these geoms work correctly:

ggplot2/R/geom-hline.r

Lines 46 to 48 in ac2b5a7

GeomHline <- ggproto("GeomHline", Geom,
draw_panel = function(data, panel_params, coord) {
ranges <- coord$backtransform_range(panel_params)

Getting this all right was quite tricky, because in many places the code wasn't very clear about whether it needed a regular range or a backtransformed range. It took quite a while to disentangle all of this. I tried to document this here:

ggplot2/R/coord-.r

Lines 16 to 30 in ac2b5a7

#' - `backtransform_range(panel_params)`: Extracts the panel range provided
#' in `panel_params` (created by `setup_panel_params()`, see below) and
#' back-transforms to data coordinates. This back-transformation can be needed
#' for coords such as `coord_trans()` where the range in the transformed
#' coordinates differs from the range in the untransformed coordinates. Returns
#' a list of two ranges, `x` and `y`, and these correspond to the variables
#' mapped to the `x` and `y` aesthetics, even for coords such as `coord_flip()`
#' where the `x` aesthetic is shown along the y direction and vice versa.
#' - `range(panel_params)`: Extracts the panel range provided
#' in `panel_params` (created by `setup_panel_params()`, see below) and
#' returns it. Unlike `backtransform_range()`, this function does not perform
#' any back-transformation and instead returns final transformed coordinates. Returns
#' a list of two ranges, `x` and `y`, and these correspond to the variables
#' mapped to the `x` and `y` aesthetics, even for coords such as `coord_flip()`
#' where the `x` aesthetic is shown along the y direction and vice versa.

@yutannihilation
Copy link
Member Author

I'm sure after_mapping() would be more confusing since we use the mapping term for a different purpose in the API (assigning data to aesthetics). I feel after_scale() is correct.

I agree with this part. While the word "scale" might have different meanings in different places in ggplot2, probably it still has cleaner meaning than "mapping," which is too general.

@thomasp85 thomasp85 added this to the ggplot2 3.4.0 milestone Mar 25, 2021
@thomasp85
Copy link
Member

Team - should we revive this for the upcoming release? I don't think my stance has changed all that much, but I'd be happy to consider an argument — I don't think a new function is the correct way because there would be no way to port this over to stage()

@clauswilke
Copy link
Member

I just reread the entire thread and I'm not sure I have a useful opinion either way. Maybe it would be a good idea to bring in @teunbrand, since he filed the original issue that brought this up.

If we want to go forward with this, I think it needs an empirical approach. Implement a version of the suggested idea and see if it breaks things or has performance implications.

@teunbrand
Copy link
Collaborator

I do have a preference that the stage() family of functions would work as one would intuitively expect. I must admit I'm not 100% familiar with all design decisions involved with map_statistic, but I have an idea that I can work out in a PR if we're to discuss implementation details.

R/layer.r Outdated
@@ -299,6 +299,9 @@ Layer <- ggproto("Layer", NULL,
# evaluation (since the evaluation symbols gets renamed)
data <- rename_aes(data)

# data needs to be non-scaled
data_orig <- scales_backtransform_df(plot$scales, data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be able to alleviate Thomas' concern that this is applied to 99% of plots, by executing this line after the early exit at the if (length(new) == 0) return(data) line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, thanks!

@teunbrand
Copy link
Collaborator

teunbrand commented May 15, 2022

I think we might even further reduce the impact of this PR on performance if we selectively backtransform the relevant columns, and leave other columns as-is.

# Suppose these are the aesthetics to evaluate
new <- aes(x = after_stat(y + 1), colour = 2, fill = Species)

# Extract all variables occuring in the aesthetics
vars <- unlist(lapply(new, all.vars), use.names = FALSE)

# Only backtransform variables that occur
data_orig <- scales_backtransform_df(scales, df[intersect(names(df), vars)])

Together with Hiroaki's suggestion earlier:

Probably we can easily skip the back-transformation by checking if the trans is identity?

I think the performance hit might be contained to a minimum.

@yutannihilation
Copy link
Member Author

Thanks!

vars <- unlist(lapply(new, all.vars), use.names = FALSE)

I think this works in most cases, there's chance that some variable is referenced indirectly (e.g. get("x")). While it's probably not a good practice, I think we should not limit the variables for safety.

@yutannihilation
Copy link
Member Author

If the concern is the performance, I think I addressed it. scales_transform_df() can be left as is, but I added the same workaround to skip non-transforming trans for consistency.

@thomasp85
Copy link
Member

@yutannihilation is this ready for review?

@yutannihilation
Copy link
Member Author

Yes.

Copy link
Member

@thomasp85 thomasp85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a news bullet indicating that this is a breaking change (though quite niche)

otherwise it's good to merge

@yutannihilation
Copy link
Member Author

Thanks! Let me find some good explanation for NEWS.

@yutannihilation
Copy link
Member Author

Added a test and a NEWS item.

@yutannihilation yutannihilation merged commit 09ef058 into tidyverse:main Jun 16, 2022
@yutannihilation yutannihilation deleted the fix/issue-4155-backtransform-data branch June 16, 2022 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible bug in stage()/after_stat() with scale transformations.
4 participants